Load necessary libraries.
options(warn=-1)
library(ggplot2)
library(class)
library(FactoMineR)
library(leaflet)
library(data.table)
library(tseries)
library(forecast)
library(RColorBrewer)
library(class)
library(ggfortify)
library(dplyr)
library(factoextra)
library(plotly)
Resampling methods involve: 1. Repeatedly drawing a sample from the training data. 2. Refitting the model of interest with each new sample. 3. Examining all of the refitted models and then drawing appropriate conclusions
With a given dataset we take N-1 recors for learning and 1 for validation. We learn with N-1 and calculate the error with 1 record. Then we iterate over the n records and we calculate the n errors. The final error of the model will be the mean of the n single records errors. Given that, at the end of the process, we’ll have a dataset of errors, I can also calculate the variance and thanks to this I can even compute confidence intervals (for a large enough N).
We generalize leave-one-out in a way that we can split the data in V parts (of N/V records). We take (N/V)-1 folds for learning and 1 fold for validation. Then we iterate N/V times changing the folds that we use for learning and validation while we calculate and mantain the error of the model for all V folds. The final error of the model will be the mean of the V validation folds. In that sense leave-one-out is a particular case of V-fold when V=1
we take random samples with replacement from the original dataset to generate an arbitrary number B of bosotrtraped datasets (comprised of a learning part and a validation part, but not necessarly of the same size for each bootstrap). The final error of the model will be the mean of the errors calculated the B validations sets generated by the process. Designed to be used when N is not large engough.
Hypherparameter tunning is the activity of finding the right set of parameters for your model. This quest must be guided by some performance metric. Using validation instead of training error as such performance metric prevents us from choosing a model that is overfitting on the training data and therefore did not extract generalizable rules for prediction. We use resampling techniques as a strategy to calculate validation errors for supervised learning problems with the advantage of allowing training on the entire dataset, thus not losing any information for training. Finding the hyperparameters that minimizes this error gives the model with bes ratio between complexity (overfit) and generalization (bias).
Selecting cluster for K-means implies tesing multiple K configurations (with multiple random intialization for each) and try to find the max value within this curve where the immediate slope starts decreaseing.
\[ J(K) = \frac{B_k}{Var}\] For hirearchichal clustering we have the hierarchical relationship between objects from K=N to K=2. Given that within hierarchical clustering we define a distance metric for individuals and groups this information is useful to us to define the cutoff points. In the dendrogram, the y-axis (height) the value of this distance metric between clusters. A good cutoff prune will go acrros the larger heights, capturing most of the between groups distances.
In both cases choosing the right K will be subject to the data scientist interpretation of the diagnotic plots, domain knowledge, expertise and objectives.
v1 <- c(0,0,1,3,3.5,1,3,4)
v2 <- c(1,2,1,1,1,5,4,5)
qplot(v1,v2)
A good clustering mechanism should maximize the in between clusters variance “S” (the groups should be as dissimilar to each other as possible) while minimizing the within cluster variance “B” (the points within the cluster have to be as similar as possible).
Given that the sum of the the total variance is
\[ Var = W+B \Rightarrow \frac{B}{W} = \alpha \frac{B}{Var}\] ### Hclustering - Manual
Step 1 - Associate the groups
\[ x_1 \epsilon A; x_2 \epsilon B; x_3 \epsilon C;x_4 \epsilon D;x_5 \epsilon E; x_6 \epsilon F; x_7 \epsilon G; x_8 \epsilon H\]
Step 2 - Calculate distance matrix
\[ d[x_i;x_j] = \sqrt{(x^{var1}_i - x^{var1}_j)^2+ (x^{var2}_i - x^{var2}_j)^2} \]
points = data.frame(cbind(v1, v2))
points$cluster = c('A','B','C','D','E','F','G','H')
points_k8 <- points
dist(points[,-3])
1 2 3 4 5 6 7
2 1.000000
3 1.000000 1.414214
4 3.000000 3.162278 2.000000
5 3.500000 3.640055 2.500000 0.500000
6 4.123106 3.162278 4.000000 4.472136 4.716991
7 4.242641 3.605551 3.605551 3.000000 3.041381 2.236068
8 5.656854 5.000000 5.000000 4.123106 4.031129 3.000000 1.414214
Step 3 - Merge x4 and x45 based on distance (0.5)
\[ x_1 \epsilon A; x_2 \epsilon B; x_3 \epsilon C;(x_4,x_5) \epsilon E; x_6 \epsilon F; x_7 \epsilon G; x_8 \epsilon H; \]
points[4:5,]$cluster = 'E'
points_k7 <- points
Step 4 - Calculate distance (applying complete linkage to calculate group distance)
#Distance between group A and the rest of points
max(dist(rbind(points[points$cluster == 'E' ,-3],points[1,-3])))
[1] 3.5
max(dist(rbind(points[points$cluster == 'E',-3], points[2,-3])))
[1] 3.640055
max(dist(rbind(points[points$cluster == 'E',-3], points[3,-3])))
[1] 2.5
max(dist(rbind(points[points$cluster == 'E',-3], points[6,-3])))
[1] 4.716991
max(dist(rbind(points[points$cluster == 'E',-3], points[7,-3])))
[1] 3.041381
max(dist(rbind(points[points$cluster == 'E',-3], points[8,-3])))
[1] 4.123106
#Ungrouped distance
dist(points[points$cluster != 'E',-3])
1 2 3 6 7
2 1.000000
3 1.000000 1.414214
6 4.123106 3.162278 4.000000
7 4.242641 3.605551 3.605551 2.236068
8 5.656854 5.000000 5.000000 3.000000 1.414214
Step 5 - Merge x1,x2 and x3 based on distance (1)
\[ (x_1,x_2,x_3) \epsilon A;(x_4,x_5) \epsilon E; x_6 \epsilon F; x_7 \epsilon G; x_8 \epsilon H; \]
points[1:3,]$cluster = 'A'
points_k5 <- points
Step 6 - Calculate distance (applying complete linkage to calculate group distance)
#Distance between group E and A
max(dist(points[points$cluster %in% c('E','A'),-3]))
[1] 3.640055
#Distance beteween A and other ungrouped points
max(dist(rbind(points[points$cluster == 'A',-3], points[6,-3])))
[1] 4.123106
max(dist(rbind(points[points$cluster == 'A',-3], points[7,-3])))
[1] 4.242641
max(dist(rbind(points[points$cluster == 'A',-3], points[8,-3])))
[1] 5.656854
#Distance beteween E and other ungrouped points
max(dist(rbind(points[points$cluster == 'E',-3], points[6,-3])))
[1] 4.716991
max(dist(rbind(points[points$cluster == 'E',-3], points[7,-3])))
[1] 3.041381
max(dist(rbind(points[points$cluster == 'E',-3], points[8,-3])))
[1] 4.123106
#Ungrouped distance
dist(points[!points$cluster %in% c('E','A'),-3])
6 7
7 2.236068
8 3.000000 1.414214
Step 7 - Merge x7,x8 and x3 based on distance (1.414214)
\[ (x_1,x_2,x_3) \epsilon A;(x_4,x_5) \epsilon E; x_6 \epsilon F; (x_7,x_8) \epsilon G \]
points[7:8,]$cluster = 'G'
points_k4 <- points
Step 8- Calculate distance (applying complete linkage to calculate group distance)
#Distance between group E and A
max(dist(points[points$cluster %in% c('A','E'),-3]))
[1] 3.640055
max(dist(points[points$cluster %in% c('A','F'),-3]))
[1] 4.123106
max(dist(points[points$cluster %in% c('A','G'),-3]))
[1] 5.656854
max(dist(points[points$cluster %in% c('E','F'),-3]))
[1] 4.716991
max(dist(points[points$cluster %in% c('E','G'),-3]))
[1] 4.123106
max(dist(points[points$cluster %in% c('E','F'),-3]))
[1] 4.716991
max(dist(points[points$cluster %in% c('F','G'),-3]))
[1] 3
Step 9 - Merge x6 and G based on distance (3)
\[ (x_1,x_2,x_3) \epsilon A;(x_4,x_5) \epsilon E; (x_6,x_7,x_8) \epsilon G \]
points[6,]$cluster = 'G'
points_k3 <- points
Step 10 - Calculate distance (applying complete linkage to calculate group distance)
#Distance between group E and A
max(dist(points[points$cluster %in% c('A','E'),-3]))
[1] 3.640055
max(dist(points[points$cluster %in% c('A','G'),-3]))
[1] 5.656854
max(dist(points[points$cluster %in% c('E','G'),-3]))
[1] 4.716991
Step 11 - Merge A and E based on distance (3.640055)
\[ (x_1,x_2,x_3,x_4,x_5) \epsilon E; (x_6,x_7,x_8) \epsilon G \]
points[1:5,]$cluster = 'E'
points_k2 <- points
Step 12 - Calculate distance (applying complete linkage to calculate group distance)
#Distance between group E and A
max(dist(points[points$cluster %in% c('E','G'),-3]))
[1] 5.656854
Step 13 - Merge E and G
\[ (x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8) \epsilon G\] Looking at the initial plot, K=3 seems appropiate. Let’s see the clustering the we got
plot(points_k3$v1,points_k3$v2,pch = 20, col = as.factor(points_k3$cluster))
We compare results with the hclust function:
D <- dist(points)
NAs introduced by coercion
hclust_points_out <- hclust(D,method='complete')
plot(hclust_points_out)
hclust_points_out_k3 <- cutree(hclust_points_out,k=3)
plot(points_k3$v1,points_k3$v2,pch = 20, col = as.factor(hclust_points_out_k3))
It seems that I arrived at the same clustering.
load("C:/Users/MCRIMI/Google Drive/Grad school/DSTI/StatLearningHighDimData/Exam/velib.Rdata")
Setup rownames and columns
#setup rownames and columns
X <- velib$data
colnames(X) <- velib$dates
rownames(X) <- paste(c(1:NROW(velib$names)),velib$names)
Check for NA’s and plot distribution
#Check for NA's
sum(apply(X,2,is.nan))
[1] 0
#Check data
boxplot(X)
In order to have a complete balanced week I’ve chosen remove the first Sunday, which has incomplete data for the day:
X <- X[,-seq_len(13)]
X$means <- rowMeans(X)
We split for posterior dynamics analysis:
weekdays_X <- X[,1:120]
weekend_X <- X[,121:168]
velibmap <- velib$position
velibmap$bonus <- velib$bonus
Looking at the boxplots, we see a clear time series pattern. I’ll take the col menas and treat it as a time series object to see the periodicity
means <- data.frame(colMeans(as.matrix(X)))
tsdata <- ts(means, start=c(1,1),end=c(7,24), frequency=24)
plot(tsdata, ylab= 'Mean avialable bikes(%)', xlab = 'Hour')
A clear periodic timeseries emerges. We look at the seasons:
ggseasonplot(tsdata, polar = FALSE) +
ggtitle("Seasonal Trends of Data") +
xlab("Hour") +
scale_color_discrete(name= "Day", labels= c("Lun","Mar","Mer","Jeu","Ven","Sam","Dim")) +
theme_bw()
NA
NA
Indeed we see that the mean occupation of the stations over that week followed a periodic pattern on the weekdays that is different from what we see on the weekends. Weekend seems to be a litte more smooth, while the weekdays bike availability has a sharp depcline at 9 and at 19 which is to be expected if the bikes are used to go to and come back from work.
We proceed to apply PCA (principal component analysis) to our dataset:
pc <- princomp(X)
By the 90th percent rule we should pick 19th components. Let’s see the other metrics:
#Eigenvalue scree
fviz_eig(pc)
#Cattell's scree test
diff = abs(diff(pc$sdev))
plot(diff,type='b', main="Cattell's scree test")
abline(h = 0.1*max(diff),lty=2,col='blue')
By looking at this last two metrics I feel comfortable using 2 components as a way to capture a significant amount of the dataset variance.
We now look at the individuals:
There were 46 warnings (use warnings() to see them)
fviz_pca_ind(pc,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping.
label = FALSE
)
Stations seem sparse over Dim.1 and Dim.2, with skew towards positive Dim.2. I can’t visually identify any clear-cut evident clusters ofi ndiviuals based on the main components.
We look at the loadings in more detail:
fviz_pca_var(pc,
col.var = "coord", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE, # Avoid text overlapping
geom.var= c("arrow", "text"),
col.circle= TRUE,
)
Hard to see anything :) - but we can comclude that 1) all loadings have a possitive effect on Dim1 an 2) there seems to be a chronological organization. Let’s look at one day in particular:
fviz_pca_var(pc,
title= "PCS - Sunday loadings by the hour",
col.var = "coord", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE, # Avoid text overlapping
geom.var= c("arrow", "text"),
col.circle= TRUE,
select.var = list("name" = colnames(X[,grepl("Dim", colnames(X))]))
)
All of the hours are close in the PC1 but greatly differ in the Dim2. Let’s look at a week day.
fviz_pca_var(pc,
title= "PCS - Wednesday loadings by the hour",
col.var = "coord", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE, # Avoid text overlapping
geom.var= c("arrow", "text"),
col.circle= TRUE,
select.var = list("name" = colnames(X[,grepl("Mer", colnames(X))]))
)
Here we have more differences in Dim1 for Monday. It seems like we workhours have lower influence in Dim1 and non working hours have more influence in Dim2. This would be consistend with the Dim1 similiraty we’ve seen for Sunday.
fviz_pca_var(pc,
title= "PCS - 09am",
col.var = "coord", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE, # Avoid text overlapping
geom.var= c("arrow", "text"),
col.circle= TRUE,
select.var = list("name" = colnames(X[,grepl("-09", colnames(X))]))
)
We see here a clear difference in both dimensions in regards to weekdays and weekened. The difference in Dim1 is explained by working-nonworking dinamics, but the distance in Dim2 is still not explained.
fviz_pca_var(pc,
title= "PCA-23hs",
col.var = "coord", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE, # Avoid text overlapping
geom.var= c("arrow", "text"),
col.circle= TRUE,
select.var = list("name" = colnames(X[,grepl("-23", colnames(X))])) #as.list(colnames(X[,grepl("Ven", colnames(X))]))
)
Looking at 23:00 hs we see a clear weekend to weekday difference in Dim2 whilet the loadings remain relatively close for Dim1. It seems like Dim2 captures the overall activity.
My interpretation of the loadings it that PC1 captures the variance of stations associated with work communte dinamics vs those that are not associated wih work dinamic (from left to right). PC2 in turn seems to capture the variance in relation to the of the general demand of the stations, from high (+) to low (-)
J = c()
for (k in 1:15){
K_means_out = kmeans(X,centers = k,nstart = 10)
J[k] = K_means_out$betweenss / K_means_out$totss
}
plot(J,type='b')
I’m inclided to choose 4 clusters by looking at the graphic
K <- 4
K_means_out <- kmeans(X,centers = K,nstart = 10)
I then proceed to plot the points within clusters against the city map. Size of the points indicate mean bike availability and the marked points are the stations with bonus.
Setting up the palette:
#plotting
darkcols <- brewer.pal(4, "Dark2")
palette <- colorFactor(darkcols[1:4], domain = NULL)
And plotting the clusters on the map:
#plotting
leaflet(velibmap) %>% addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(radius = (X$means)*5,
color = palette(K_means_out$cluster),
stroke = ~ifelse(bonus == "1", TRUE, FALSE),
label = ~paste(row.names(X), sep = " - Clus.:", K_means_out$cluster),
fillOpacity = 0.9)
Assuming "longitude" and "latitude" are longitude and latitude, respectively
At a first look ot seems like the K-means clustering was able to find the groups by average bike availability. It’s clear that we have a particular cluster for the city center. It also seems to have done a good job capturing the statins with a bonus feature in a separate cluster.
We now try hirearchical clustering.
X$means <- rowMeans(X)
D = dist(X) # Compute the distance matrix between all observations
hclust_out = hclust(D,method='complete')
plot(hclust_out, labels = FALSE)
Looking at the dendogram I also incline for 4 clusters
hclust3 <- cutree(hclust_out,k=3)
hclust4 <- cutree(hclust_out,k=4)
leaflet(velibmap) %>% addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(radius = (X$means)*6,
color = palette(hclust4),
stroke = ~ifelse(bonus == "1", TRUE, FALSE),
label = ~paste(row.names(X), sep = " - Clus.:", hclust4),
fillOpacity = 0.9)
Assuming "longitude" and "latitude" are longitude and latitude, respectively
It seems that the hierarchical clustering found very similar groupings to the ones that K-means found.
Looking at the clusters we seem have captured the groupings by the mean demand. Cluster 2 (orange) seems to group the stations at the city center (all over the river seine riviera). Cluster 4 (in pink) seems to caputre the stations with less average availabilty therefore the boruoughs where this medium of transportation is most used. This is consistent with the fact that most bonus flagged stations fall under this cluster.Cluster 3 (violet) seems to capture stations with larger average bike availability.
qplot(X$means, geom="boxplot", factor(hclust4), ylab='cluster', xlab="Cluster mean availability")
We see a clear difference between Cluster 4 and Cluster 3 in their mean bike avialability, while Clusters 1 and 2 are pretty similiar.
Let’s look at the dynamics to see if we can understand Cluster 2 and Cluster 3
We look at the weekly averages by the hour:
#grouped by hour
options(warn = -1)
meansByHour <- data.frame(rowMeans(X))
meansByHour$station <- rownames(X)
meansByHour <- cbind(meansByHour, velib$position)
meansByHour$'01' <- rowMeans(X[,grepl("01", colnames(X))])
meansByHour$'03' <- rowMeans(X[,grepl("03", colnames(X))])
meansByHour$'02' <- rowMeans(X[,grepl("02", colnames(X))])
meansByHour$'04' <- rowMeans(X[,grepl("04", colnames(X))])
meansByHour$'05' <- rowMeans(X[,grepl("05", colnames(X))])
meansByHour$'06' <- rowMeans(X[,grepl("06", colnames(X))])
meansByHour$'07' <- rowMeans(X[,grepl("07", colnames(X))])
meansByHour$'08' <- rowMeans(X[,grepl("08", colnames(X))])
meansByHour$'09' <- rowMeans(X[,grepl("09", colnames(X))])
meansByHour$'10' <- rowMeans(X[,grepl("10", colnames(X))])
meansByHour$'11' <- rowMeans(X[,grepl("11", colnames(X))])
meansByHour$'12' <- rowMeans(X[,grepl("12", colnames(X))])
meansByHour$'13' <- rowMeans(X[,grepl("13", colnames(X))])
meansByHour$'14' <- rowMeans(X[,grepl("14", colnames(X))])
meansByHour$'15' <- rowMeans(X[,grepl("15", colnames(X))])
meansByHour$'16' <- rowMeans(X[,grepl("16", colnames(X))])
meansByHour$'17' <- rowMeans(X[,grepl("17", colnames(X))])
meansByHour$'18' <- rowMeans(X[,grepl("18", colnames(X))])
meansByHour$'19' <- rowMeans(X[,grepl("19", colnames(X))])
meansByHour$'20' <- rowMeans(X[,grepl("20", colnames(X))])
meansByHour$'21' <- rowMeans(X[,grepl("21", colnames(X))])
meansByHour$'22' <- rowMeans(X[,grepl("22", colnames(X))])
meansByHour$'23' <- rowMeans(X[,grepl("23", colnames(X))])
meansByHour$cluster <- hclust4
meansByHour <- meansByHour[,-1]
melted_means_h<-melt(setDT(meansByHour), id.vars = c('station','latitude','longitude', 'cluster'))
cols <- c("variable", "station","cluster")
melted_means_h <- tibble(melted_means_h)
melted_means_h[cols] <- lapply(melted_means_h[cols], as.factor) ## as.factor() could also be used
p <- melted_means_h %>%
plot_ly(
x = ~longitude,
y = ~latitude,
size = ~value,
color = ~cluster,
frame = ~variable,
text = ~station,
hoverinfo = 'station',
type = 'scatter',
mode = 'markers'
#fill = ~'',
#marker = list(sizemode = 'diameter')
)
print(p)
`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.
NULL
Looking at the hourly dinamics we see a clear trend of bikes coming from Cluster 1 and Cluster 3 the city center (Cluster 2) during the day and leaving after 18hs. Cluster 3 seems less affected by this dinamic, with larger overall availability. This is even more visible if we only consider weekdays:
weekdays_X$cluster4 <- hclust4
meansByHour <- data.frame(rowMeans(weekdays_X))
meansByHour$station <- rownames(weekdays_X)
meansByHour <- cbind(meansByHour, velib$position)
meansByHour$'00' <- rowMeans(weekdays_X[,grepl("00", colnames(weekdays_X))])
meansByHour$'01' <- rowMeans(weekdays_X[,grepl("01", colnames(weekdays_X))])
meansByHour$'03' <- rowMeans(weekdays_X[,grepl("03", colnames(weekdays_X))])
meansByHour$'02' <- rowMeans(weekdays_X[,grepl("02", colnames(weekdays_X))])
meansByHour$'04' <- rowMeans(weekdays_X[,grepl("04", colnames(weekdays_X))])
meansByHour$'05' <- rowMeans(weekdays_X[,grepl("05", colnames(weekdays_X))])
meansByHour$'06' <- rowMeans(weekdays_X[,grepl("06", colnames(weekdays_X))])
meansByHour$'07' <- rowMeans(weekdays_X[,grepl("07", colnames(weekdays_X))])
meansByHour$'08' <- rowMeans(weekdays_X[,grepl("08", colnames(weekdays_X))])
meansByHour$'09' <- rowMeans(weekdays_X[,grepl("09", colnames(weekdays_X))])
meansByHour$'10' <- rowMeans(weekdays_X[,grepl("10", colnames(weekdays_X))])
meansByHour$'11' <- rowMeans(weekdays_X[,grepl("11", colnames(weekdays_X))])
meansByHour$'12' <- rowMeans(weekdays_X[,grepl("12", colnames(weekdays_X))])
meansByHour$'13' <- rowMeans(weekdays_X[,grepl("13", colnames(weekdays_X))])
meansByHour$'14' <- rowMeans(weekdays_X[,grepl("14", colnames(weekdays_X))])
meansByHour$'15' <- rowMeans(weekdays_X[,grepl("15", colnames(weekdays_X))])
meansByHour$'16' <- rowMeans(weekdays_X[,grepl("16", colnames(weekdays_X))])
meansByHour$'17' <- rowMeans(weekdays_X[,grepl("17", colnames(weekdays_X))])
meansByHour$'18' <- rowMeans(weekdays_X[,grepl("18", colnames(weekdays_X))])
meansByHour$'19' <- rowMeans(weekdays_X[,grepl("19", colnames(weekdays_X))])
meansByHour$'20' <- rowMeans(weekdays_X[,grepl("20", colnames(weekdays_X))])
meansByHour$'21' <- rowMeans(weekdays_X[,grepl("21", colnames(weekdays_X))])
meansByHour$'22' <- rowMeans(weekdays_X[,grepl("22", colnames(weekdays_X))])
meansByHour$'23' <- rowMeans(weekdays_X[,grepl("23", colnames(weekdays_X))])
meansByHour$cluster <- hclust4
meansByHour <- meansByHour[,-1]
melted_means_h<-melt(setDT(meansByHour), id.vars = c('station','latitude','longitude', 'cluster'))
cols <- c("variable", "station","cluster")
melted_means_h <- tibble(melted_means_h)
melted_means_h[cols] <- lapply(melted_means_h[cols], as.factor) ## as.factor() could also be used
p <- melted_means_h %>%
plot_ly(
x = ~longitude,
y = ~latitude,
size = ~value,
color = ~cluster,
frame = ~variable,
text = ~station,
hoverinfo = 'station',
type = 'scatter',
mode = 'markers',
title = "Weekdays dynamic"
)
print(p)
`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.`line.width` does not currently support multiple values.'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
'scatter' objects don't have these attributes: 'title'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
NULL
These findings are consistent with what I found to be the main componentes the PCA. The main cluster differences are explained, in my view, by the different location of the stations in the city-center to perifery commute time dinamics (for example Cluster 1 and Cluster 2 have similar mean daily avaialbility, but the hourly distribution of avialability is quite different) and overall general avaialbiltiy of bikes in the station (notably the difference between Clusters 3 and Cluster 4 in terms of mean avaialbility)